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Abstract 

In this paper, we propose a new method remMap — REgularized Multi- 
variate regression for identifying MAster Predictors — for fitting multivariate 
response regression models under the high-dimension-low-sample-size setting. 
remMap is motivated by investigating the regulatory relationships among differ- 
ent biological molecules based on multiple types of high dimensional genomic 
data. Particularly, we are interested in studying the influence of DNA copy 
number alterations on RNA transcript levels. For this purpose, we model the 
dependence of the RNA expression levels on DNA copy numbers through multi- 
variate linear regressions and utilize proper regularization to deal with the high 
dimensionality as well as to incorporate desired network structures. Criteria 
for selecting the tuning parameters are also discussed. The performance of the 
proposed method is illustrated through extensive simulation studies. Finally, 
remMap is applied to a breast cancer study, in which genome wide RNA tran- 
script levels and DNA copy numbers were measured for 172 tumor samples. We 
identify a trans- hub region in cytoband 17ql2-q21, whose amplification influ- 
ences the RNA expression levels of more than 30 unlinked genes. These findings 
may lead to a better understanding of breast cancer pathology. 

Key words: sparse regression, MAP (MAster Predictor) penalty, DNA copy num- 
ber alteration, RNA transcript level, v-fold cross validation. 



1 Introduction 

In a few recent breast cancer cohort studies, microarray expression experiments 
and array CGH (comparative genomic hybridization) experiments have been con- 
ducted for more than 170 primary breast tumor specimens collected at multiple can- 
cer centers flSorlie et al. 20011 ISorlie et al.^003l IZhao et al. 20041 |Kapp et al. ^006 
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Bergamaschi et al. 2006; Langerod et al. 2007 Bergamaschi et al. 2008). The re- 



sulting RNA transcript levels (from microarray expression experiments) and DNA 
copy numbers (from CGH experiments) of about 20K genes/clones across all the tu- 
mor samples were then used to identify useful molecular markers for potential clinical 
usage. While useful information has been revealed by analyzing expression arrays 
alone or CGH arrays alone, careful integrative analysis of DNA copy numbers and 
expression data are necessary as these two types of data provide complimentary in- 
formation in gene characterization. Specifically, RNA data give information on genes 
that are over/under-expressed, but do not distinguish primary changes driving can- 
cer from secondary changes resulting from cancer, such as proliferation rates and 
differentiation state. On the other hand, DNA data give information on gains and 
losses that are drivers of cancer. Therefore, integrating DNA and RNA data helps 
to discern more subtle (yet biologically important) genetic regulatory relationships in 
cancer cells (IPoUack et al. 2002^ . 

It is widely agreed that variations in gene copy numbers play an important 
role in cancer development through altering the expression levels of cancer-related 
genes (lAlbertson et al. 20030 . This is clear for cis-regulations, in which a gene's DNA 
copy number alteration influences its own RNA transcript level (Hyman et al. 2002 



IPoUack et al. 20021) . However, DNA copy number alterations can also alter in trans 
the RNA transcript levels of genes from unlinked regions, for example by directly al- 
tering the copy number and expression of transcriptional regulators, or by indirectly 
altering the expression or activity of transcriptional regulators, or through genome re- 
arrangements affecting cis-regulatory elements. The functional consequences of such 
trans-regulations are much harder to establish, as such inquiries involve assessment of 
a large number of potential regulatory relationships. Therefore, to refine our under- 
standing of how these genome events exert their effects, we need new analytical tools 
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that can reveal the subtle and complicated interactions among DNA copy numbers 
and RNA transcript levels. Knowledge resulting from such analysis will help shed 
light on cancer mechanisms. 

The most straightforward way to model the dependence of RNA levels on DNA 
copy numbers is through a multivariate response linear regression model with the 
RNA levels being responses and the DNA copy numbers being predictors. While the 
multivariate linear regression is well studied in statistical literature, the current prob- 
lem bears new challenges due to (i) high- dimensionality in terms of both predictors 
and responses; (ii) the interest in identifying master regulators in genetic regulatory 
networks; and (iii) the complicated relationships among response variables. Thus, the 
naive approach of regressing each response onto the predictors separately is unlikely to 
produce satisfactory results, as such methods often lead to high variability and over- 



fitting. This has been observed by many authors, for example, Breiman et al. (1997) 



show that taking into account of the relation among response variables helps to im- 



prove the overall prediction accuracy. More recently, Kim et al. (2008) propose a 
new statistical framework to explicitly incorporate the relationships among responses 
by assuming the linked responses depend on the predictors in a similar way. The 
authors show that this approach helps to select relevant predictors when the above 
assumption holds. 

When the number of predictors is moderate or large, model selection is often 
needed for prediction accuracy and/or model interpretation. Standard model selection 
tools in multiple regression such as AIC and forward stepwise selection have been ex- 
tended to multivariate linear regression models (IBedrick et al. 1994} Fujikoshi et al. 1997 



ILutz and Biihlmann 2006|) . More recently, sparse regularization schemes have been 
utilized for model selection under the high dimensional multivariate regression setting. 



For example, Turlach et al. (2005) propose to constrain the coefficient matrix of a 
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multivariate regression model to lie within a suitable polyhedral region. Lutz and Biihlmann (2006) 



propose an L2 multivariate boosting procedure. Obozinskiy et al. (2008) propose to 
use a £1/^2 regularization to identify the union support set in the multivariate regres- 
sion. Moreover, Brown et al. (1998, 1999, 2002) introduce a Bayesian framework to 
model the relation among the response variables when performing variable selection 
for multivariate regression. Another way to reduce the dimensionality is through fac- 



tor analysis. Related work includes Izenman (1975) , Frank et al. (1993) , Reinsel and Velu (1998) , Yuan et 
and many others. 

For the problem we are interested in here, the dimensions of both predictors and 
responses are large (compared to the sample size). Thus in addition to assuming 
that only a subset of predictors enter the model, it is also reasonable to assume 
that a predictor may affect only some but not all responses. Moreover, in many 
real applications, there often exist a subset of predictors which are more impor- 
tant than other predictors in terms of model building and/or scientific interest. For 
example, it is widely believed that genetic regulatory relationships are intrinsically 
sparse dJeong et al. 20011 IGardner et al. 20"03l) . At the same time, there exist master 
regulators — network components that affect many other components, which play im- 
portant roles in shaping the network functionality. Most methods mentioned above do 
not take into account the dimensionality of the responses, and thus a predictor/factor 



influences either all or none responses, e.g., Turlach et al. (2005), Yuan et al. (2007) 



the L2 row boosting by Lutz and Biihlmann (2006), and the ^1/^2 regularization by 



Obozinskiy et al. (2008) On the other hand, other methods only impose a sparse 



model, but do not aim at selecting a subset of predictors, e.g., the L2 boosting by 



Lutz and Biihlmann (2006) In this paper, we propose a novel method remMap 



REgularized Multivariate regression for identifying MAster Predictors, which takes 
into account both aspects. remMap uses an ii norm penalty to control the over- 
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all sparsity of the coefficient matrix of the multivariate linear regression model. In 
addition, remMap imposes a "group" sparse penalty, which in essence is the same 



as the "group lasso" penalty proposed by Bakin (1999), Antoniadis and Fan (2001) 



Yuan and Lin (2006) , Zhao et al. (2006) and Obozinskiy et al. (2008) (see more dis- 



cussions in Section [2]). This penalty puts a constraint on the £2 norm of regression 
coefficients for each predictor, which controls the total number of predictors entering 
the model, and consequently facilitates the detection of master predictors. The per- 
formance of the proposed method is illustrated through extensive simulation studies. 

We apply the remMap method on the breast cancer data set mentioned earlier and 
identify a significant trans-hub region in cytoband 17ql2-q21, whose amplification 
influences the RNA levels of more than 30 unlinked genes. These findings may shed 
some light on breast cancer pathology. We also want to point out that analyzing 
CGH arrays and expression arrays together reveals only a small portion of the regu- 
latory relationships among genes. However, it should identify many of the important 
relationships, i.e., those reflecting primary genetic alterations that drive cancer de- 
velopment and progression. While there are other mechanisms to alter the expression 
of master regulators, for example by DNA mutation or methylation, in most cases 
one should also find corresponding DNA copy number changes in at least a subset 
of cancer cases. Nevertheless, because we only identify the subset explainable by 
copy number alterations, the words "regulatory network" ("master regulator") used 
in this paper will specifically refer to the subnetwork (hubs of the subnetwork) whose 
functions change with DNA copy number alterations, and thus can be detected by 
analyzing CGH arrays together with expression arrays. 

The rest of the paper is organized as follows. In Section 2, we describe the remMap 
model, its implementation and criteria for tuning. In Section 3, the performance of 
remMap is examined through extensive simulation studies. In Section 4, we apply the 
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remMap method on the breast cancer data set. We conclude the paper with discussions 
in Section 5. Technical details are provided in the supplementary material. 



2 Method 



2.1 Model 



Consider multivariate regression with Q response variables ■ ■ ■ , and P predic- 
tion variables Xi, ■ ■ ■ , Xp: 



where the error terms ei, ■ ■ ■ , eg have a joint distribution with mean and covariance 
Se- The primary goal of this paper is to identify non-zero entries in the P x Q 
coefficient matrix B = {(3pq) based on i.i.d samples from the above model. Under 
normality assumptions, (3pq can be interpreted as proportional to the conditional 
correlation Cor(yg, where := {xp' : I < p' p < P}- In the following, 

we use Yq = {y^, ■ ■ ■ , y^)^ and Xp = {Xp, ■ ■ ■ ,Xp)'^ to denote the sample of the q*^ 
response variable and that of the p*'^ prediction variable, respectively. We also use 
Y = {Yi : ■ ■ ■ : Yq) to denote the N xQ response matrix, and use X = (Xi : ■ ■ ■ : Xp) 
to denote the N x P prediction matrix. 

In this paper, we shall focus on the cases where both Q and P are larger than 
the sample size N. For example, in the breast cancer study discussed in Section |H 
the sample size is 172, while the number of genes and the number of chromosomal 
regions are on the order of a couple of hundreds (after pre-screening). When P > N, 
the ordinary least square solution is not unique, and regularization becomes indis- 
pensable. The choice of suitable regularization depends heavily on the type of data 



p 




(1) 
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structure we envision. In recent years, ^i-norm based sparsity constraints such as lasso 
( ITibshirani 1996|l have been widely used under such high-dimension-low-sample-size 



settings. This kind of regularization is particularly suitable for the study of genetic 
pathways, since genetic regulatory relationships are widely believed to be intrinsically 
sparse (Jeong et al. 2001 [Gardner et al. 2003p . In this paper, we impose an ii norm 



penalty on the coefficient matrix B to control the overall sparsity of the multivariate 
regression model. In addition, we put constraints on the total number of predictors 
entering the model. This is achieved by treating the coefficients corresponding to 
the same predictor (one row of B) as a group, and then penalizing their £2 norm. A 
predictor will not be selected into the model if the corresponding £2 norm is shrunken 
to 0. Thus this penalty facilitates the identification of master predictors — predic- 
tors which affect (relatively) many response variables. This idea is motivated by the 
fact that master regulators exist and are of great interest in the study of many real 
life networks including genetic regulatory networks. Specifically, for model ([I]), we 
propose the following criterion 

^ p p p 

L(B; Al, A2) = -||Y — ''^XpBpWp + Ai ^ \ \Cp ■ Bp\\i + A2 ^ \ \Cp ■ Bp\\2, (2) 

p=i p=i p=i 

where Cp is the pth row of C = (cpg) = {Cf : ■ ■ ■ : Cp)'^, which is a pre-specified 
P X Q 0-1 matrix indicating the coefficients on which penalization is imposed; Bp is 
the p*'* row of B; || ■ ||f denotes the Frobenius norm of matrices; || ■ ||i and || ■ II2 are 
the £1 and £2 norms for vectors, respectively; and "■" stands for Hadamard product 
(that is, entry- wise multiplication). The indicator matrix C is pre-specified based on 
prior knowledge: if we know in advance that predictor Xp affects response yq, then 
the corresponding regression coefficient /3pg will not be penalized and we set — 
(see Section m for an example). When there is no such prior information, C can be 
simply set to a constant matrix Cpg = 1. Finally, an estimate of the coefficient matrix 
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B is B(Ai, A2) := argminB L{B; Ai, A2). 

In the above criterion function, the ii penalty induces the overall sparsity of 
the coefficient matrix B. The £2 penalty on the row vectors Cp ■ Bp induces row 
sparsity of the product matrix C ■ B. As a result, some rows are shrunken to be 
entirely zero (Theorem [1]). Consequently, predictors which affect relatively more 
response variables are more likely to be selected into the model. We refer to the 
combined penalty in equation ([2]) as the MAP (MAster Predictor) penalty. We also 
refer to the proposed estimator B(Ai,A2) as the remMap (REgularized Multivariate 
regression for identifying MAster Predictors) estimator. Note that, the £2 penalty is 
a special case (with a = 2) of the more general penalty form: J2p=i I ' U' where 
IblU •= (Z]?=i l"^?!")" fo^^ ^ vector v E TZ^ and a > 1. In Turlach et al. (2005), a 



penalty with a = 00 is used to select a common subset of prediction variables when 
modeling multivariate responses. In Yuan et al. (2007), a constraint with a = 2 
is applied to the loading matrix in a multivariate linear factor regression model for 



dimension reduction. In Obozinskiy et al. (2008) , the same constraint is applied to 
identify the union support set in the multivariate regression. In the case of multiple 



regression, a similar penalty corresponding to a = 2 is proposed by Bakin (1999) and 



by Yuan and Lin (2006) for the selection of grouped variables, which corresponds to 



the blockwise additive penalty in Antoniadis and Fan (2001) for wavelet shrinkage. 



Zhao et al. (2006) propose the penalty with a general a > 1. However, none of these 



methods takes into account the high dimensionality of response variables and thus 
predictors/factors are simultaneously selected for all responses. On the other hand, 
by combining the £2 penalty and the £1 penalty together in the MAP penalty, the 
remMap model not only selects a subset of predictors, but also limits the influence 
of the selected predictors to only some (but not necessarily all) response variables. 
Thus, it is more suitable for the cases when both the number of predictors and the 
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number of responses are large. Lastly, we also want to point out a difference between 



the MAP penalty and the ElasticNet penalty proposed by Zou et al. (2005), which 
combines the ^l norm penalty with the squared £2 norm penalty. The ElasticNet 
penalty aims to encourage a group selection effect for highly correlated predictors 
under the multiple regression setting. However, the squared £2 norm itself does not 
induce sparsity and thus is intrinsically different from the I2 norm penalty discussed 
above. 

In Section [3l we use extensive simulation studies to illustrate the effects of the 
MAP penalty. We compare the remMap method with two alternatives: (i) the joint 
method which only utilizes the li penalty, that is A2 = in ([2]); (ii) the sep method 
which performs Q separate lasso regressions. We find that, if there exist large hubs 
(master predictors), remMap performs much better than joint in terms of identifying 
the true model; otherwise, the two methods perform similarly. This suggests that the 
"simultaneous" variable selection enhanced by the £2 penalty pays off when there exist 
a small subset of "important" predictors, and it costs little when such predictors are 
absent. In addition, both remMap and joint methods impose sparsity of the coeffi- 
cient matrix as a whole. This helps to borrow information across different regressions 
corresponding to different response variables, and thus incorporates the relationships 
among response variables into the model. It also amounts to a greater degree of reg- 
ularization, which is usually desirable for the high-dimension-low-sample-size setting. 
On the other hand, the sep method controls sparsity for each individual regression 
separately and thus is subject to high variability and over-fitting. As can be seen by 
the simulation studies (Section [3]), this type of "joint" modeling greatly improves the 



model efficiency. This is also noted by other authors including Turlach et al. (2005) 



Lutz and Biihlmann (2006) and Obozinskiy et al. (2008) 
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2.2 Model Fitting 

In this section, we propose an iterative algorithm for solving the remMap estimator 
B(Ai,A2). This is a convex optimization problem when the two tuning parameters 
are not both zero, and thus there exists a unique solution. We first describe how to 
update one row of B, when all other rows are fixed. 

Theorem 1 Given {-Bpjp^pQ in the solution for miuBp^ L(B; Xi, X2) is given by 
Bpo = 0po,i, ■ ■ ■ , J^po,q) which satisfies: fori <q<Q 

(i) Ifc,,,, = 0, A,o,, = X^^YJWXJII (OLS), where Y, = Y,- E,^^^ X,P„; 

(ii) IfCp^,q = 1, 

' 0, z/ ||fip^-°||2,c = 0; 



l-^s^ — „^ „2 I Otherwise 



where 



lasso I 



\2,C 



1/2 



and 



The proof of Theorem 1 is given in the supplementary material (Appendix A). 

Theorem [1] says that, when estimating the p^^ row of the coefficient matrix B with 
all other rows fixed: if there is a pre-specified relationship between the predictor 
and the q^^ response (i.e., Cp^^g = 0), the corresponding coefficient /Jp^ g is estimated 
by the (univariate) ordinary least square solution (OLS) using current responses Y^; 
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otherwise, we first obtain the lasso solution P^^^q by the (univariate) soft shrinkage 
of the OLS solution (equation (jlj)), and then conduct a group shrinkage of the lasso 
solution (equation ([3])). From Theorem [H it is easy to see that, when the design 
matrix X is orthonormal: X-^X = Ip and Ai = 0, the remMap method amounts to 
selecting variables according to the £2 norm of their corresponding OLS estimates. 

Theorem [T] naturally leads to an algorithm which updates the rows of B iter- 
atively until convergence. In particular, we adopt the active-shooting idea pro- 



posed by Peng et al. (2008) and Friedman et al. (2008) , which is a modification of 



the shooting algorithm proposed by Fu (1998) and also Friedman et al. (2007) 
among others. The algorithm proceeds as follows: 



1. Initial step: for p = 1, P; q = 1, Q, 



X^YjWXpWl 



sign 



II Y P" 



if c 

if c. 



p,<} 



p,<i 



0; 
1. 



(5) 



2. Define the current active-row set A = {p : current ||-Bp||2,c 7^ 0}. 

(2.1) For each p e A, update Bp with all other rows of B fixed at their current 
values according to Theorem [H 

(2.2) Repeat (2.1) until convergence is achieved on the current active-row set. 

3. For p = 1 to P, update Bp with all other rows of B fixed at their current values 
according to Theorem [H If no Bp changes during this process, return the current 
B as the final estimate. Otherwise, go back to step 2. 

It is clear that the computational cost of the above algorithm is in the order of 
OiNPQ). 
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2.3 Tuning 

In this section, we discuss the selection of the tuning parameters (Ai,A2) by v-fold 
cross vahdation. To perform the v-fold cross validation, we first partition the whole 
data set into V non-overlapping subsets, each consisting of approximately 1/V frac- 
tion of total samples. Denote the z*^' subset as -D*^*^ = (Y^^^X*^*)), and its comple- 
ment as = (Y-«,X-(*)). For a given (Ai, A2), we obtain the remMap estimate: 
B*^*^(Ai,A2) = iPpq) based on the i^^ training set D~^^\ We then obtain the ordi- 
nary least square estimates B^jg(Ai,A2) = {(^oispq) follows: for 1 < g < Q, define 
Sq = {p : 1 < p < P, (3pq 7^ 0}. Then set (3^]!.pg = ii p ^ Sg] otherwise, define 
{(^ois pq '■ P ^ Sq} a,s the ordinary least square estimates by regressing Yg~^'^ onto 
{Xp^^^ : p G Sq}. Finally, prediction error is calculated on the test set D^"^^: 

remMap. cv,(Ai,A2) := ||Y« - X»B«(Ai, A2)||^. (6) 
The v-fold cross validation score is then defined as 

V 

remMap . cv(Ai, A2) = remMap . cv.(Ai, A2). (7) 
The reason for using OLS estimates in calculating the prediction error is because 



the true model is assumed to be sparse. As noted by Efron et al. (2004) , when there 
are many noise variables, using shrunken estimates in the cross validation criterion 
often results in over fitting. Similar results are observed in our simulation studies: if 
in ([6]) and (I7j), the shrunken estimates are used, the selected models are all very big 
which result in large numbers of false positive findings. In addition, we also try AIC 
and GOV for tuning and both criteria result in over fitting as well. These results are 
not reported in the next section due to space limitation. 

In order to further control the false positive findings, we propose a method called 
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cv.vote. The idea is to treat the training data from each cross- vahdation fold as 
a "bootstrap" sample. Then variables being consistently selected by many cross 
validation folds should be more likely to appear in the true model than the variables 
being selected only by few cross validation folds. Specifically, for 1 < p < P and 
1 < < define 

, , I 1, if Er=i^(S?(Ai,A2)^0)>K; 

Spq{XuX2) = < (8) 

I 0, otherwise. 

where is a pre-specified integer. We then select edge {p,q) if Spq(Ai,A2) = 1. In 
the next section, we use = 5 and thus cv.vote amounts to a "majority vote" 
procedure. Simulation studies in Section [3] suggest that, cv.vote can effectively 
decrease the number of false positive findings while only slightly increase the number 
of false negatives. 

An alternative tuning method is by a BIG criterion. Compared to v-fold cross 
validation, BIG is computationally cheaper. However it requires much more assump- 
tions. In particular, the BIG method uses the degrees of freedom of each remMap 
model which is difficult to estimate in general. In the supplementary material, we 
derive an unbiased estimator for the degrees of freedom of the remMap models when 
the predictor matrix X has orthogonal columns (Theorem 2 of Appendix B in the 
supplementary materials). In Section [3l we show by extensive simulation studies that, 
when the correlations among the predictors are complicated, this estimator tends to 
select very small models. For more details see the supplementary material. Appendix 
B. 
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3 Simulation 



In this section, we investigate the performance of the remMap model and compare it 
with two alternatives: (i) the joint model with A2 = in ([2]); (ii) the sep model 
which performs Q separate lasso regressions. For each model, we consider three tuning 
strategies, which results in nine methods in total: 

1. remMap. cv, joint, cv, sep.cv: The tuning parameters are selected through 
10-fold cross validation; 

2. remMap. cv. vote, joint . cv. vote, sep.cv. vote: The cv.vote procedure 
with = 5 is applied to the models resulted from the corresponding * . cv 
approaches; 

3. remMap. bic, joint. bic, sep.bic: The tuning parameters are selected 
by a BIC criterion. For remMap. bic and joint. bic, the degrees of freedom 
are estimated according to equation ( 1S-6I) in Appendix B of the supplementary 
material; for sep.bic, the degrees of freedom of each regression is estimated by 
the total number of selected predictors (IZou et al. 20071) . 

We simulate data as follows. Given (A^, P, Q) , we first generate the predictors 
{xi, ■ ■ ■ , Xp)'^ ~ Normal p{0, Sx); where Sx is the predictor covariance matrix (for 
simulations 1 and 2, 5]x(p,p') := plf ^'). Next, we simulate a. P x Q 0-1 adjacency 
matrix A, which specifies the topology of the network between predictors and re- 
sponses, with A{p, q) = 1 meaning that Xp influences i/q, or equivalently Ppg 7^ 0. In 
all simulations, we set P = Q and the diagonals of A equal to one, which is viewed 
as prior information (thus the diagonals of C are set to zero). This aims to mimic 
cis-regulations of DNA copy number alternations on its own expression levels. We 
then simulate the P x Q regression coefficient matrix B = (/3pg) by setting jSpg = 0, 
if A(p, g) = 0; and Ppg ~ Uniform([-5, -1] U [1,5]), if A{p,q) = 1. After that, we 
generate the residuals (ei, ■ ■ ■ ,^q)'^ ~ NormalQ{0,'S^), where Yl^{q,q') = a^p^e 
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The residual variance cr^ is chosen such that the average signal to noise ratio equals 
to a pre-specified level s. Finally, the responses ■ ■ ■ , yq)'^ are generated according 
to model ([1]). Each data set consists of i.i.d samples of such generated predictors 
and responses. For all methods, predictors and responses are standardized to have 
(sample) mean zero and standard deviation one before model fitting. Results reported 
for each simulation setting are averaged over 25 independent data sets. 

For all simulation settings, C = (cpg) is taken to be Cpg = 0, if p = q; and 
Cpq = 1, otherwise. Our primary goal is to identify the trans-edges — the predictor- 
response pairs {xp,yq) with A{p,q) = 1 and C{p,q) = 1, i.e., the edges that are not 
pre-specified by the indicator matrix C. Thus, in the following, we report the number 
of false positive detections of trans-edges (FP) and the number of false negative 
detections of trans-edges (FN) for each method. We also examine these methods in 
terms of predictor selection. Specifically, a predictor is called a cis-predictor if it 
does not have any trans-edges; otherwise it is called a trans-predictor. Moreover, 
we say a false positive trans-predictor (FPP) occurs if a cis-predictor is incorrectly 
identified as a trans-predictor; we say a false negative trans-predictor (FNP) occurs 
if it is the other way around. 

Simulation I 

We first assess the performances of the nine methods under various combinations of 
model parameters. Specifically, we consider: P = Q = 400, 600, 800; s = 0.25, 0.5, 0.75; 
Px = 0, 0.4, 0.8; and = 0, 0.4, 0.8. For all settings, the sample size N is fixed at 200. 
The networks (adjacency matrices A) are generated with 5 master predictors (hubs), 
each influencing 20 ~ 40 responses; and all other predictors are cis-predictors. 
We set the total number of tran-edges to be 132 for all networks. Results on 
trans-edge detection are summarized in Figures [1] and [2l From these figures, it is 
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clear that remMap.cv and remMap. cv.vote perform the best in terms of the total 
number of false detections (FP+FN), followed by remMap. bic. The three sep meth- 
ods result in too many false positives (especially sep.cv). This is expected since 
there are in total Q tuning parameters selected separately, and the relations among 
responses are not utilized at all. This leads to high variability and over-fitting. The 
three joint methods perform reasonably well, though they have considerably larger 
number of false negative detections compared to remMap methods. This is because 
the joint methods incorporate less information about the relations among the re- 
sponses caused by the master predictors. Finally, comparing cv.vote to cv, we can 
see that the cv.vote procedure effectively decreases the false positive detections and 
only slightly inflates the false negative counts. 

As to the impact of different model parameters, signal size s plays an important 
role for all methods: the larger the signal size, the better these methods perform 



(Figure 1(a)). Dimensionality {P,Q) also shows consistent impacts on these meth- 



ods: the larger the dimension, the more false negative detections (Figure 1(b)). With 
increasing predictor correlation p^., both remMAP.bic and joint. bic tend to select 
smaller models, and consequently result in less false positives and more false nega- 



tives (Figure 2(a)). This is because when the design matrix X is further away from 
orthogonality, (1S-6P tends to overestimate the degrees of freedom and consequently 
smaller models are selected. The residual correlation seems to have little impact 



on joint and sep, and some (though rather small) impacts on remMap (Figure 2(b) ). 
Moreover, remMap performs much better than joint and sep on master predictor se- 
lection, especially in terms of the number of false positive trans-predictors (results 
not shown). This is because the £2 norm penalty is more effective than the £1 norm 
penalty in excluding irrelevant predictors. 
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Simulation II 

In this simulation, we study the performance of these methods on a network without 
big hubs. The data are generated similarly as before with P = Q = 600, = 200, 
s = 0.25, Px = 0.4, and = 0. The network consists of 540 cis-predictors, and 
60 trans-predictors with 1 ~ 4 trans-edges. This leads to 151 trans-edges in 
total. As can be seen from Tabled], remMap methods and joint methods now perform 
very similarly and both are considerably better than the sep methods. Indeed, under 
this setting, A2 is selected (either by cv or bic) to be small in the remMap model, 
making it very close to the joint model. 



Table 1: Simulation II. Network topology: uniform network with 151 trans-edges 
and 60 trans-predictors. P = Q = 600, N = 200; s = 0.25; p,^ = 0.4; p, = 0. 



Method 


FP FN TF 


FPP FNP 


remMap. bic 
remMap . cv 
remMap. cv. vote 


4.72(2.81) 45.88(4.5) 50.6(4.22) 
18.32(11.45) 40.56(5.35) 58.88(9.01) 
2.8(2.92) 50.32(5.38) 53.12(3.94) 


1.36(1.63) 11(1.94) 
6.52(5.07) 9.2(2) 
0.88(1.26) 12.08(1.89) 


joint .bic 
joint . cv 
joint . cv. vote 


5.04(2.68) 52.92(3.6) 57.96(4.32) 
16.96(10.26) 46.6(5.33) 63.56(7.93) 
2.8(2.88) 56.28(5.35) 59.08(4.04) 


4.72(2.64) 9.52(1.66) 
15.36(8.84) 7.64(2.12) 
2.64(2.92) 10.40(2.08) 


sep. bic 
sep . cv 
sep . cv . vote 


78.92(8.99) 37.44(3.99) 116.36(9.15) 
240.48(29.93) 32.4(3.89) 272.88(30.18) 
171.00(20.46) 33.04(3.89) 204.04(20.99) 


67.2(8.38) 5.12(1.72) 
179.12(18.48) 2.96(1.51) 
134.24(14.7) 3.6(1.50) 



FP: false positive; FN: false negative; TF: total false; FPP: false positive trans-predictor; 
FNP: false negative trans-predictor. Numbers in the parentheses are standard deviations 



Simulation III 

In this simulation, we try to mimic the true predictor covariance and network topol- 
ogy in the real data discussed in the next section. We observe that, for chromoso- 
mal regions on the same chromosome, the corresponding copy numbers are usually 
positively correlated, and the magnitude of the correlation decays slowly with ge- 
netic distance. On the other hand, if two regions are on different chromosomes, the 
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correlation between their copy numbers could be either positive or negative and in 
general the magnitude is much smaller than that of the regions on the same chro- 
mosome. Thus in this simulation, we first partition the P predictors into 23 distinct 
blocks, with the size of the i'^'^ block proportional to the number of CNAI (copy num- 
ber alteration intervals) on the i^^ chromosome of the real data (see Section H] for 
the definition of CNAI). Denote the predictors within the i*^ block 
where Qi is the size of the i*^ block. We then define the within- block correlation as: 
CoTT{xij,Xii) = p^b"'"'' 1 ^ J7 ^ ^ 9u and define the between-block correlation as 
Con {xij^,Xki) = Pik ioT 1 < j < Qi, 1 < I < Qk and 1 < i 7^ A; < 23. Here, pik 
is determined in the following way: its sign is randomly generated from {—1,1}; its 
magnitude is randomly generated from {pbb, Pbb' ' ' ' ' Pbb)- ^^^^ simulation, we set 
p„b = 0.9, Pbb = 0.25 and use P = Q = 600, N = 200, s = 0.5, and p, = 0.4. The 
heatmaps of the (sample) correlation matrix of the predictors in the simulated data 
and that in the real data are given by Figure S-2 in the supplementary material. The 
network is generated with five large hub predictors each having 14 ~ 26 trans-edges; 
five small hub predictors each having 3 ~ 4 trans-edges; 20 predictors having 1 ~ 2 
trans-edges; and all other predictors being cis-predictors. 

The results are summarized in Table[2l Among the nine methods, remMap . cv . vote 
performs the best in terms of both edge detectiion and master predictor prediction. 
remMAP.bic and joint .bic result in very small models due to the complicated corre- 
lation structure among the predictors. While all three cross-validation based methods 
have large numbers of false positive findings, the three cv.vote methods have much 
reduced false positive counts and only slightly increased false negative counts. These 
findings again suggest that cv.vote is an effective procedure in controlling false pos- 
itive rates while not sacrificing too much in terms of power. 
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Table 2: Simulation III. Network topology: five large hubs and five small hubs 
with 151 trans-edges and 30 trans-predictors. P = Q = 600, = 200; s = 
0.5; Pwb = 0.9, pbb = 0.25; = 0.4. 



Method 


FP FN TF 


FPP FNP 


remMap.bic 
remMap . cv 
remMap.cv.vote 


0(0) 150.24(2.11) 150.24(2.11) 
93.48(31.1) 20.4(3.35) 113.88(30.33) 
48.04(17.85) 27.52(3.91) 75.56(17.67) 


0(0) 29.88(0.33) 
15.12(6.58) 3.88(1.76) 
9.16(4.13) 5.20(1.91) 


joint .bic 
joint . cv 
joint . cv. vote 


7.68(2.38) 104.16(3.02) 111.84(3.62) 
107.12(13.14) 39.04(3.56) 146.16(13.61) 
63.80(8.98) 47.44(3.90) 111.24(10.63) 


7(2.18) 10.72(1.31) 
66.92(8.88) 1.88(1.2) 
41.68(6.29) 2.88(1.30) 


sep.bic 
sep . cv 
sep. cv.vote 


104.96(10.63) 38.96(3.48) 143.92(11.76) 
105.36(11.51) 37.28(4.31) 142.64(12.26) 
84.04(10.47) 41.44(4.31) 125.48(12.37) 


64.84(6.29) 1.88(1.17) 
70.76(7.52) 1.92(1.08) 
57.76 (6.20) 2.4 (1.32) 



FP: false positive; FN: false negative; TF: total false; FPP: false positive trans-predictor; 
FNP: false negative trans-predictor. Numbers in the parentheses are standard deviations 



4 Real application 

In this section, we apply the proposed remMap method to the breast cancer study 
mentioned earlier. Our goal is to search for genome regions whose copy number 
alterations have significant impacts on RNA expression levels, especially on those of 
the unlinked genes, i.e., genes not falling into the same genome region. The findings 
resulting from this analysis may help to cast light on the complicated interactions 
among DNA copy numbers and RNA expression levels. 



4.1 Data preprocessing 

The 172 tumor samples were analyzed using cDNA expression microarray and CGH 



array experiments as described in Sorlie et al. (2001) , Sorlie et al. (2003) , Zhao et al. (2004) , Kapp et al. (i 



Bergamaschi et al. (2006) , Langerod et al. (2007) , and Bergamaschi et al. (2008 



In below, we outline the data preprocessing steps. More details are provided in the 
supplementary material (Appendix C). 

Each CGH array contains measurements {log2 ratios) on about 17 K mapped hu- 
man genes. A positive (negative) measurement suggests a possible copy number gain 
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(loss). After proper normalization, cghFLasso (Tibshirani and Wang 2008) is used to 
estimate the DNA copy numbers based on array outputs. Then, we derive copy num- 
ber alteration intervals (CNAIs) — basic CNA units (genome regions) in which genes 
tend to be amplified or deleted at the same time within one sample — by employing 



the Fixed-Order Clustering (FOC) method ( |Wang 2004] ) . In the end, for each CNAI 
in each sample, we calculate the mean value of the estimated copy numbers of the 
genes falling into this CNAI. This results in a 172 (samples) by 384 (CNAIs) numeric 
matrix. 

Each expression array contains measurements for about 18K mapped human 
genes. After global normalization for each array, we also standardize each gene's mea- 
surements across 172 samples to median= and MAD (median absolute deviation) 
= 1. Then we focus on a set of 654 breast cancer related genes, which is derived based 
on 7 published breast cancer gene lists (ISorlie et al. 20031 [van de Vijver et al. 2002 



Chang et al.^OOil [PmFet al. 2004|[Wang et al. 2"005| ISotiriou et al. 2006(ISaal et al. 200711 . 



This results in a 172 (samples) by 654 (genes) numeric matrix. 

When the copy number change of one CNAI affects the RNA level of an unlinked 
gene, there are two possibilities: (i) the copy number change directly affects the RNA 
level of the unlinked gene; (ii) the copy number change first affects the RNA level 
of an intermediate gene (either linked or unlinked), and then the RNA level of this 
intermediate gene affects that of the unlinked gene. Figure [3] gives an illustration of 
these two scenarios. In this study, we are more interested in finding the relationships 
of the first type. Therefore, we first characterize the interactions among RNA levels 
and then account for these relationships in our model so that we can better infer 
direct interactions. For this purpose, we apply the space (Sparse PArtial Correlation 
Estimation) method to search for associated RNA pairs through identifying non- 



zero partial correlations (Peng et al. 2008). The estimated (concentration) network 



21 



(referred to as Exp. Net. 664 hereafter) has in total 664 edges — 664 pairs of genes 
whose RNA levels significantly correlate with each other after accounting for the 
expression levels of other genes. 

Another important factor one needs to consider when studying breast cancer is 
the existence of distinct tumor subtypes. Population stratification due to these dis- 
tinct subtypes could confound our detection of associations between CNAIs and gene 
expressions. Therefore, we introduce a set of subtype indicator variables, which 
later on is used as additional predictors in the remMap model. Specifically, follow- 



ing Sorlie et al. (2003), we divide the 172 patients into 5 distinct groups based on 
their expression patterns. These groups correspond to the same 5 subtypes sug- 



gested by Sorhe et al. (2003) — Luminal Subtype A, Luminal Subtype B, ERBB2- 
overexpressing Subtype, Basal Subtype and Normal Breast-like Subtype. 



4.2 Interactions between CNAIs and RNA expressions 

We then apply the remMap method to study the interactions between CNAIs and RNA 
transcript levels. For each of the 654 breast cancer genes, we regress its expression 
level on three sets of predictors: (i) expression levels of other genes that are connected 
to the target gene (the current response variable) in Exp. Net. 664 ! (h) the five subtype 
indicator variables derived in the previous section; and (iii) the copy numbers of all 
384 CNAIs. We are interested in whether any unlinked CNAIs are selected into this 
regression model (i.e., the corresponding regression coefficients are non-zero). This 
suggests potential trans-regulations (trans-edges) between the selected CNAIs and 
the target gene expression. The coefficients of the linked CNAI of the target gene are 
not included in the MAP penalty (this corresponds to Cpg = 0, see Section 2 for details). 
This is because the DNA copy number changes of one gene often influence its own 
expression level, and we are less interested in this kind of cis-regulatory relationships 
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(cis-edges) here. Furthermore, based on Exp. Net. 664, penalties are imposed on 
the expression levels of connected genes either. In another word, we view the cis- 
regulations between CNAIs and their linked expression levels, as well as the inferred 
RNA interaction network as "prior knowledge" in our study. 

Note that, different response variables (gene expressions) now have different sets of 
predictors, as their neighborhoods in Exp. Net. 664 are different. However, the remMap 
model can still be fitted with a slight modification. The idea is to treat all CNAI 
(384 in total), all gene expressions (654 in total), as well as five subtype indicators 
as nominal predictors. Then, for each target gene, we force the coefficients of those 
gene expressions that do not link to it in Exp. Net. 664 to be zero. We can easily 
achieve this by setting those coefficients to zero without updating them throughout 
the iterative fitting procedure. 

We select tuning parameters (Ai, A2) in the remMap model through a 10- fold cross 
validation as described in Section [2.31 The optimal (Ai,A2) corresponding to the 
smallest CV score from a grid search is (355.1,266.7). The resulting model con- 
tains 56 trans-regulations in total. In order to further control false positive findings, 
we apply the cv.vote procedure with Va = 5, and filter away 13 out of these 56 
trans-edges which have not been consistently selected across different CV folds. 
The remaining 43 trans-edges correspond to three contiguous CNAIs on chromo- 
some 17 and 31 distinct (unlinked) RNAs. Figure |4] illustrates the topology of the 
estimated regulatory relationships. The detailed annotations of the three CNAIs and 
31 RNAs are provided in Table [3] and Table |H Moreover, the Pearson-correlations 
between the DNA copy numbers of CNAIs and the expression levels of the regu- 
lated genes/clones (including both cis-regulation and trans-regulation) across 
the 172 samples are reported in Table HI As expected, all the cis-regulations have 
much higher correlations than the potential trans-regulations. In addition, none of 
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Table 3: Genome locations of the three CNAIs having (estimated) trans-regulations. 



Index 


Cytoband 


Begin ^ 


End 1 


^ of clones^ 


^ of Trans- Reg^ 


1 


17ql2-17ql2 


34811630 


34811630 


1 


12 


2 


17ql2-17ql2 


34944071 


35154416 


9 


30 


3 


17q21.1-17q21.2 


35493689 


35699243 


7 


1 



1. Nucleotide position (bp). 

2. Number of genes/clones on the array falling into the CNAI. 

3. Number of unlinked genes whose expressions are estimated to be regulated by the CNAI. 



the subtype indicator variables is selected into the final model, which implies that 
the detected associations between copy numbers of CNAIs and gene expressions are 
unlikely due to the stratification of the five tumor subtypes. 

The three CNAIs being identified as trans-regulators sit closely on chromosome 
17, spanning from 348116306j> to 356992435j> and falling into cytoband 17ql2-q21.2. 
This region (referred to as CNAI-17ql2 hereafter) contains 24 known genes, including 
the famous breast cancer oncogene ERBB2, and the growth factor receptor-bound 
protein 7 (GRB7). The over expression of GRB7 plays pivotal roles in activating 
signal transduction and promoting tumor growth in breast cancer cells with chro- 
mosome 17qll-21 amplification (IBai and Louh 20080 . In this study, CNAI-17ql2 is 
highly amplified (normalized log2 ratio> 5) in 33 (19%) out of the 172 tumor samples. 
Among the 654 genes/clones considered in the above analysis, 8 clones (correspond- 
ing to six genes including ERBB2, GRB7, and MED24) fall into this region. The 
expressions of these 8 clones are all up-regulated by the amplification of CNAI-17ql2 
(see Table m for more details), which is consistent with results reported in the liter- 
ature (IKao and Pollack 20060 . More importantly, as suggested by the result of the 
remMap model, the amplification of CNAI-17ql2 also infiuences the expression levels of 
31 unlinked genes/clones. This implies that CNAI-17ql2 may harbor transcriptional 
factors whose activities closely relate to breast cancer. Indeed, there are 4 transcrip- 
tion factors (NEUR0D2, IKZF3, THRA, NRIDI) and 2 transcriptional co-activators 
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(MEDl, MED24) in CNAI-17ql2. It is possible that the amphfication of CNAI-17ql2 
results in the over expression of one or more transcription factors/co-activators in this 
region, which then influence the expressions of the unlinked 31 genes/clones. In ad- 
dition, some of the 31 genes/clones have been reported to have functions directly 
related to cancer and may serve as potential drug targets (see Appendix C.6 of the 
supplementary material for more details). In the end, we want to point out that, be- 
sides RNA interactions and subtype stratification, there could be other unaccounted 
confounding factors. Therefore, caution must be applied when one tries to interpret 
these results. 

5 Discussion 

In this paper, we propose the remMap method for fitting multivariate regression mod- 
els under the large P, Q setting. We focus on model selection, i.e., the identification of 
relevant predictors for each response variable. remMap is motivated by the rising needs 
to investigate the regulatory relationships between different biological molecules based 
on multiple types of high dimensional omics data. Such genetic regulatory networks 
are usually intrinsically sparse and harbor hub structures. Identifying the hub regula- 
tors (master regulators) is of particular interest, as they play crucial roles in shaping 
network functionality. To tackle these challenges, remMap utilizes a MAP penalty, which 
consists of an ii norm part for controlling the overall sparsity of the network, and 
an £2 norm part for further imposing a row-sparsity of the coefficient matrix, which 
facilitates the detection of master predictors (regulators). This combined regulariza- 
tion takes into account both model interpretability and computational tractability. 
Since the MAP penalty is imposed on the coefficient matrix as a whole, it helps to 
borrow information across different regressions and thus incorporates the relation- 
ships among response variables into the model. As illustrated in Section 3, this type 
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of "joint" modeling greatly improves model efficiency. Also, the combined ii and £2 
norm penalty further enhances the performance on both edge detection and master 
predictor identification. 

We also propose a cv.vote procedure to make better use of the cross validation 
results. The idea is to treat the training data from each cross-validation fold as a 
"bootstrap" sample and identify the variables consistently selected in the majority 
of the cross-validation folds. As suggested by the simulation study, this procedure is 
very effective in decreasing the number of false positives while only slightly increases 
the number of false negatives. Moreover, cv.vote can be applied to a broad range of 
model selection problems when cross validation is employed. 

In the real application, we apply the remMap method on a breast cancer data set. 
Our goal is to investigate the influences of DNA copy number alterations on RNA 
transcript levels based on 172 breast cancer tumor samples. The resulting model sug- 
gests the existence of a trans-hub region on cytoband 17ql2-q21, whose amplification 
influences RNA levels of 31 unlinked genes. Cytoband 17ql2-q21 is a well known 
hot region for breast cancer, which harbors the oncogene ERBB2. The above results 
suggest that this region may also harbor important transcriptional factors. While 
our findings are intriguing, clearly additional investigation is warranted. One way to 
verify the above conjecture is through a sequence analysis to search for common mo- 
tifs in the upstream regions of the 31 RNA transcripts, which remains as our future 
work. 

Besides the above application, the remMap model can be applied to investigate the 
regulatory relationships between other types of biological molecules. For example, 
it is of great interest to understand the influence of single nucleotide polymorphism 
(SNP) on RNA transcript levels, as well as the influence of RNA transcript levels 
on protein expression levels. Such investigation will improve our understanding of 
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related biological systems as well as disease pathology. In addition, we can utilize 
the remMap idea to other models. For example, when selecting a group of variables in 
a multiple regression model, we can impose both the £2 penalty (that is, the group 
lasso penalty), as well as an ii penalty to encourage within group sparsity. Similarly, 
the remMap idea can also be applied to vector autoregressive models and generalize 
linear models. 

R package remMap is public available through CRAN {http : / / cr an. r— project.org/). 
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Table 4: RNAs^ being influenced by the ampliflcations of the three CNAIs in Table |3l 



Clone ID 


Gene symbol 


Cytoband 


Correlation 


753692 


ABLIMI 


10q25 


0.199 


896962 


ACADS 


12q22-qter 


-0.22 


753400 


ACTL6A 


3q26.33 


0.155 


472185 


ADAMTSl 


21q21.2 


0.214 


210687 


AGTRl 


3q21-q25 


-0.182 


856519 


ALDH3A2 


17pll.2 


-0.244 


270535 


BM466581 


19 


0.03 


238907 


CABCl 


lq42.13 


-0.174 


773301 


CDH3 


16q22.1 


0.118 


505576 


CORIN 


4pl3-pl2 


0.196 


223350 


CP 


3q23-q25 


0.184 


810463 


DHRS7B 


17pl2 


-0.151 


50582 


FLJ25076 


5pl5.31 


0.086 


669443 


HSF2 


6q22.31 


0.207 


743220 


JMJD4 


lq42.13 


-0.19 


43977 


KIAA0182 


16q24.1 


0.259 


810891 


LAMA5 


20ql3.2-ql3.3 


0.269 


247230 


MARVELD2 


5ql3.2 


-0.214 


812088 


NLN 


5ql2.3 


0.093 


257197 


NRBF2 


10q21.2 


0.275 


782449 


PCBP2 


12ql3.12-ql3.13 


-0.079 


796398 


PEG3 


19ql3.4 


0.169 


293950 


P1P5K1A 


Iq22-q24 


-0.242 


128302 


PTMS 


12pl3 


-0.248 


146123 


PTPRK 


6q22.2-q22.3 


0.218 


811066 


T~-V TV TT 1 J -1 

RNF41 


12ql3.2 


-0.247 


773344 


SLC16A2 


Xql3.2 


0.24 


1031045 


SLC4A3 


2q36 


0.179 


141972 


STT3A 


llq23.3 


0.182 


454083 


i MrU 


l/q/2 


0.175 


825451 


USOl 


4q21.1 


0.204 


68400 


BM455010 


17 


0.748 


756253,365147 


ERBB2 


17qll.2-ql2— 17q21.1 


0.589 


510318,236059 


GRB7 


17ql2 


0.675 


245198 


MED24 


17q21.1 


0.367 


825577 


STARD3 


17qll-ql2 


0.664 


782756^ 


TBPLl 


6q22.1-q22.3 


0.658 



1. The first part of the table hsts the inferred trans-regulated genes. The second 
part of the table lists cis-regulated genes. 

2. This cDNA sequence probe is annotated with TBPLl, but actually 
maps to one of the 17q21.2 genes. 
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(b) Impact of predictor and response dimensionality P {Q 
Pe ~ 0; the total number of trans-edges is 132. 
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Figure 1: Impact of signal size and dimensionality. Heights of solid bars represent numbers 
of false positive detections of trans-edges (FP); heights of shaded bars represent numbers 
of false negative detections of trans-edges (FN). All bars are truncated at height=132. 
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Figure 2: Impact of correlations. Heights of solid bars represent numbers of false positive 
detections of trans-edges (FP); heights of shaded bars represent numbers of false negative 
detections of trans-edges (FN). All bars are truncated at height=132. 
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(a) r CNAI A 



GeneB 




(b) r CNAI A 



GeneC 



GeneB 



Figure 3: (a) Direct interaction between CNAI A and the expression of gene B; 
(b) indirect interaction between CNAI A and the expression of Gene B through one 
intermediate gene. 




Figure 4: Network of the estimated regulatory relationships between the copy numbers 
of the 384 CNAIs and the expressions of the 654 breast cancer related genes. Each 
blue node stands for one CNAI, and each green node stands for one gene. Red edges 
represent inferred trans- regulations (43 in total). Grey edges represent cis-regulations. 
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Supplementary Material 



Appendix A: Proof of Theorem [T] 

Define 



^ Q Q 



q=l q=l 



q=l 



It is obvious that, in order to prove Theorem [T|, we only need to show that, the 
solution of min^ F, X), is given by (for g = 1, ■ ■ ■ , Q) 
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0, if ||3^^^^°||2 = 

fl - „a ^'m , otherwise. 



where 



In the following, for function L, view : 7^ Q"} as fixed. With a slight abuse of 
notation, write L = L{[3q). Then when /S^ > 0, we have 



dPq ' m 



Thus, ^ > if and only if l3q > jSq , where 



2 



1 



Denote the minima of L{/3g)\f3^>o by /Sg^min- Then, when Pg'^ > 0, /S+j^j^^ = . 
On the other hand, when < 0, /Sg'^min — 0- Note that Pg'^ > if and only if 
xyg{l — ^) > 0. Thus we have 

^ P+, if xyg{l - > 0; 



0^ ■ 



0' if^?/.(i-;^)<0- 



Similarly, denote the minima of L{Pg)\n<Q by /3„min, and define 



Then we have 



Pg-:=^^{1 + ^) 



/^o.mi 



if^?/.(i + 4)<0; 



5, mm 



0, if^?/.(l + ^)>0. 



Denote the minima of L{Pg) as Pg (with a slight abuse of notation). From the above, it 
is obvious that, if xyg > 0, then Pg > 0. Thus Pg = max(/3+, 0) = ^ '^^1^ (1 — ^)+ = 
2^^A2 (-^~];^)+- Similarly, iixyg < 0, then /3q < 0, and it has the same expression as 
above. Denote the minima of L(/5)|p||2>o (now viewed as a function of {Pi, ■ ■ ■ , Pq)) 
as Paiin = {Pi,mm, ' ' ' ,PQ,min)- We have showu above that, if such a minima exists, it 
satisfies (for q = 1, ■ ■ ■ ,Q) 



f, ^1 ^ _ Slasso X 



„2 



iPminlb ^ llPminlb 

where p[^^° is defined by equation (IS-ll) . Thus 



I \Pmm\ I2 — 11/5 



Slasso I I 



'^2;2 J h. 



Il/3mi„l|2 



2 



By solving the above equation, we obtain 



ll/9min||2=||/5 



Slasso II "^2 



2 



X 



2 ' 



By plugging the expression on the right hand side into (lS-21) . we achieve 

3 . = ^asso [ 1 ^2 

yq,mm Hq \ ^ , , o 

Denote the minima of L{(3) by /3 = ■ ■ ■ , /3q). From the above, we also know that 
if 1 I2 — ^ > 0, L{/3) achieves its minimum on ||/9||2 > 0, which is f3 = /^min- 
Otherwise, L{(3) achieves its minimum at zero. Since 1 1/^'''™! I2 - ^ > if and only if 
1 — — o > 0, we have proved the theorem. 



Appendix B: BIC criterion for tuning 

In this section, we describe the BIC criterion for selecting (Ai, A2). We also derive an 
unbiased estimator of the degrees of freedom of the remMap estimator under orthogonal 
design. 

In model ([I]), by assuming ~ Normal (0, o"^^), the BIC criterion for the q^^ 
regression can be defined as 

BICg(A„ ■ ■ ■ , ^pq-, dfg) = Nx log(RSS,) + logiV X df„ (S-3) 

where RSSg := Yln=iiyq ~ V^Y "^i^^ = Yll=i^pPpq\ and dfg is the degrees of 
freedom which is defined as 

TV 

df, = df,(A„ ■■■3pq):=Y. Cov(y^, (S-4) 

n=l 



3 



where cr^ is the variance of e^. 



For a given pair of (Ai, A2), We then define the (overall) BIC criterion at (Ai, A2): 

Q Q 
BIC(Ai, A2) = iV X ^ log(RSS,(Ai, A2)) + log iV x ^ df,(Ai, A2). (S-5) 

q=l q=l 



Efron et al. (2004) derive an explicit formula for the degrees of freedom of lars un- 



der orthogonal design. Similar strategy are also used by Yuan and Lin (2006) among 
others. In the following theorem, we follow the same idea and derive an unbiased 
estimator of dfq for remMap when the columns of X are orthogonal to each other. 

Theorem 2 Suppose XjXp/ = for all 1 <p ^ p' < P . Then for given (Ai, A2), 

Al 



C?/g(Al, A2) 



p=l ^ II Pll2/ V 



12 
pl I2 



X 



A2 1 


1 R lasso 
l^p 


1 12 
\\2,C 




\-^p\ 


to to 


D lasso 1 
P 1 


\2,C 



p 

5^(1 -c,,,) (S-6) 
p=i 



is an unbiased estimator of the degrees of freedom dfg(Ai,A2) (defined in equation 
i S-4\ )) of the remMap estimator B = B(Ai,A2) = {/3pq{Xi , X2)) ■ Here, under the 
orthogonal design, l3pq,P^^^° are given by Theorem IJ\ with Yg = Yg (q = I,-- - ,Q), 



XpYa 

PQ Tix;ii2 



and — 



Before proving Theorem [2], we first explain definition (1S-4P - the degrees of free- 
dom. Consider the g*'^ regression in model ([T]). Suppose that {y^}n=i are the fitted 
values by a certain fitting procedure based on the current observations {y^ : n = 
I,-- - ,N;q = 1, ■ ■ ■ ,Q}. Let fig := Yl^=i^pl^pq- Then for a fixed design matrix 
X = (Xp), the expected re-scaled prediction error of {y"}^^^ in predicting a future 



4 



set of new observations {yq}n=i from the g*'' regression of model ([T]) is: 

N N 

PE, = 5^ E{{y^ - y^r)/al = ^ - + N. 



n=l n=l 



Note that 



Therefore, 

PE, = ^ - y-f)/al + 2 Cov(y^, y-)/al. 

n=l n=l 

Denote RSSg = — VgY- Then an un-biased estimator of PE^ is 



N 
n=l 



Therefore, a natural definition of the degrees of freedom for the procedure resulting 
the fitted values {y^}^^i is as given in equation flS-4p . Note that, this is the definition 



used in Mallow's Cp criterion. 

Proof of Theorem (2) By applying Stein's identity to the Normal distribution, we 
have: if Z ~ N{fi,a'^), and a function g such that E{\g'{Z)\) < oo, then 

Coy{g{Z),Z)/a' = E{g'{Z)). 

Therefore, under the normality assumption on the residuals {e^}^^^ in model ([I]), 
definition (lS-41) becomes 



N 

I ov: \ 

I,--- 



5 



Thus an obvious unbiased estimator of dfg is df^ = J2n=i a^- following, 
we derive dfg for the proposed remMap estimator under the orthogonal design. Let 
Pq = [Piq, ■ ■ ■ ,l3pq) be a one by P row vector; let X = (x^) be the by P design 



matrix which is orthonormal; let Yq 



be by one column vectors. Then 



.y^r andF, 



df. 



tr X 



dY, 



tr X 



#g,ois 9Yq 



where /3g,ois = {(^ig-, ■ ■ ■ , Ppq) and the last equality is due to the chain rule. Since 



under the orthogonal design, /5' 



ols 
pq 



XpyjUpWl where 



X. 



N\T 
p ) ' 



thus 



9A 



'q,ols 



dYa 



DX^, where D is a P by P diagonal matrix with the p*^ diagonal entry 



being l/||Xp||2. Therefore 



rf/ = tr I x4^DX^ "1 = tr f DX^X- 



dPq^ols 



dp, 



tr 



(J, ols 



d(3q 

dPq^ols 



E 



da 



pq 



PQ 



where the second to last equality is by X"^X = D ^ which is due to the orthogonality 
of X. By the chain rule 

dPpq 

By Theorem 1, under the orthogonal design. 



pq 



I olasso 1 1 ^ 
l^p l|2,C> 



A, 



l^pl I2 



X 



A, 



IX 



p\ I2 



IP 



lasso I 1 2 
P 



/'^asso\2 

2,C \ypq ) 



2,C 



and 



if Cp,g = 0; 



if c 



P,Q 



1. 



Note that when Cp^g = 0, = (3°]^, thus = 1. It is then easy to show that df 
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is as given in equation (lS-6p . 

Note that, when the £2 penalty parameter A2 is 0, the model becomes q separate 
lasso regressions with the same penalty parameter Ai and the degrees of freedom 
estimation in equation flS-6p is simply the total number of non-zero coefficients in 
the model (under orthogonal design). When A2 is nonzero, the degrees of freedom of 
remMap estimator should be smaller than the number of non-zero coefficients due to 
the additional shrinkage induced by the £2 norm part of the MAP penalty (equation 
(3)). This is reflected by equation (]S-6p . 



Appendix C: Data Preprocessing 
C.l Preprocessing for array CGH data 

Each array output {log2 ratios) is flrst standardized to have median= and smoothed 



by cghFLasso (Tibshirani and Wang 2008) for deflning gained/lost regions on the 



genome. The noise level of each array is then calculated based on the measurements 
from the estimated normal regions (i.e., regions with estimated copy numbers equal 
to 2). After that, each smoothed array is normalized according to its own noise level. 
We deflne copy number alteration intervals (CNAIs) by using the Fixed-Order 



Clustering (FOC) method ( Wang 2004 ). FOC flrst builds a hierarchical clustering 
tree along the genome based on all arrays, and then cuts the tree at an appropriate 
height such that genes with similar copy numbers fall into the same CNAI. FOC 
is a generalization of the CLAC (CLuster Along Chromosome) method proposed by 



Wang et al. (2005) It differs in two ways from the standard agglomerative clustering. 
First, the order of the leaves in the tree is flxed, which represents the genome order 
of the genes/clones in the array. So, only adjacent clusters are joined together when 
the tree is generated by a bottom-up approach. Second, the similarity between two 
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clusters no longer refers to the spatial distance but to the similarity of the array 
measurements {log2 ratio) between the two clusters. By using FOC, the whole genome 
is divided into 384 non-overlapping CNAIs based on all 172 CGH arrays. This is 
illustrated in Figure IS-ll In addition, the heatmap of the (sample) correlations of the 
CNAIs is given in Figure S-2. 



C.2 Selection of breast cancer related genes 



We combine seven published breast cancer gene lists: the intrinsic gene set ( ISorlie et al. 20031) . 



the Amsterdam 70 gene ( van de Vijver et al. 2002 ), the wound response gene set 
( Chang et al. 2004 ), the 76 genes for the metastasis signature ( Wang et al. 2005 ), the 
gene list for calculating the recurrence score (IPaik et al. 20041) , the gene list of the Ge- 
nomic Grade Index (GGI) (ISotiriou et al. 20061) . and the PTEN gene set f lSaal et al. 2"007|) 
Among this union set of breast cancer related genes, 967 overlap with the genes in the 
current expression data set. We further filter away genes with missing measurements 
in more than 20% of the samples, and 654 genes are left. Among these 654 selected 
genes, 449 are from the intrinsic gene set (ISorlie et al. 20031) . which are used to derive 
breast cancer subtype labels in Appendix C.4. 



C.3 Interactions among RNA expressions 



We apply the space (Sparse PArtial Correlation Estimation) method ( Peng et al. 2008 ) 
to infer the interactions among RNA levels through identifying non-zero partial cor- 
relations, space assumes the overall sparsity of the partial correlation matrix and 
employs sparse regression techniques for model fitting. As indicated by many exper- 
iments that genetic-regulatory networks have a power-law type degree distribution 
with a power parameter between 2 and 3 (INewman 20030 . the tuning parameter in 
space is chosen such that the resulting network has an estimated power parameter 
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around 2 (see Figure S-3(b) for the corresponding degree distribution). The resulting 
(concentration) network has 664 edges in total, whose topology is illustrated in Fig- 
ure S-3(a). In this network, there are 7 nodes having at least 10 edges. These hub 
genes include PLKl, PTTGl, AURKA, ESRl, and GATA3. PLKl has important 
functions in maintaining genome stability via its role in mitosis. Its over expression 
is associated with preinvasive in situ carcinomas of the breast fIRizki et al. 2007p . 



PTTGl is observed to be a proliferation marker in invasive ductal breast carcino- 
mas (ITalvinen et al. 20080 . AURKA encodes a cell cycle-regulated kinase and is 
a potential metastasis promoting gene for breast cancer (IThomassen et al. 20081) . 



ESRl encodes an estrogen receptor, and is a well known key player in breast cancer. 
Moreover, it had been reported that GATA3 expression has a strong association with 
estrogen receptor in breast cancer (IVoduc et al. 2008]) . Detailed annotation of these 



and other hub genes are listed in Table [S^ We refer to this network as Exp. Net. 664, 
and use it to account for RNA interactions when investigating the regulations between 
CNAIs and RNA levels. 

Table S-1: Annotations for hub genes (degrees greater than 10) in the inferred RNA 
interaction network Exp. Net. 664- 



ClonelD 


Gene Name 


Symbol 


ID 


Cytoband 


744047 


Polo-like kinase 1 (Drosophila) 


PLKl 


5347 


16pl2.1 


781089 


Pituitary tumor-transforming 1 


PTTGl 


9232 


5q35.1 


129865 


Aurora kinase A 


AURKA 


6790 


20ql3.2-ql3.3 


214068 


GATA binding protein 3 


GATA3 


2625 


10pl5 


950690 


Cyclin A2 


CCNA2 


890 


4q25-q31 


120881 


RAB31, member RAS oncogene family 


RAB31 


11031 


18pll.3 


725321 


Estrogen receptor 1 


ESRl 


2099 


6q25.1 



C.4 Breast Cancer Subtypes 

Population stratification due to distinct subtypes could confound our detection of 
associations between CNAIs and gene expressions. For example, if the copy number 

9 



of CNAI A and expression level of gene B are both higher in one subtype than in 
the other subtypes, we could observe a strong correlation between CNAI A and gene 
expression B across the whole population, even when the correlation within each sub- 
type is rather weak. To account for this potential confounding factor, we introduce 
a set of subtype indicator variables, which is used as additional predictors in the 
remMap model. Specifically, we derive subtype labels based on expression patterns by 



following the work of Sorlie et al. (2003) We first normalize the expression levels of 
each intrinsic gene (449 in total) across the 172 samples to have mean zero and MAD 
one. Then we use kmeans clustering to divide the patients into five distinct groups. 



which correspond to the five subtypes suggested by Sorlie et al. (2003) — Luminal 
Subtype A, Luminal Subtype B, ERBB2-overexpressing Subtype, Basal Subtype and 
Normal Breast-like Subtype. Figure IS-41 illustrates the expression patterns of these 
five subtypes across the 172 samples. We then define five dummy variables to repre- 
sent the subtype information for each tumor sample, and include them as predictors 
when fitting the remMap model. 



C.5 Comments on the results of the remMap analysis. 

remMap analysis suggests that the amplification of a trans-hub region on chromosome 
17 influences the RNA expression levels of 31 distinct (unlinked) genes. Some of 
the 31 genes/clones have been reported to have functions directly related to can- 
cer and may serve as potential drug targets. For example, AGTRl is a receptor 
whose genetic polymorphisms have been reported to associate with breast cancer 
risk and is possibly druggable (IKoh et al. 2005( 1. CDH3 encodes a cell-cell adhe- 
sion glycoprotein and is deemed as a candidate of tumor suppressor gene, as dis- 
turbance of intracellular adhesion is important for invasion and metastasis of tumor 
cells f lKremmidiotis et al. 1998j) . PEG3 is a mediator between p53 and Bax in DNA 
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damage-induced neuronal death (IJohnson et al. 20020 and may function as a tumor 



suppressor gene (Dowdy et al. 2005). In a word, these 31 genes may play functional 
roles in the pathogenesis of breast cancer and may serve as additional targets for 
therapy. 
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Figure S-1: Hierarchical tree constructed by FOC. Each leaf represents one gene/clone 
on the array. The order of the leaves represents the order of genes on the genome. 
The 23 Chromosomes are illustrated with different colors. Cutting the tree at 0.04 
(horizonal red line) separates the genome into 384 intervals. This cutoff point is 
chosen such that no interval contains genes from different chromosomes. 
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Figure S-2: Heatmaps of the sample correlations among predictors. Top panel: sim- 
ulated data; Bottom panel: real data 
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(a) Exp. Net. 664'. Inferred network for the 654 breast cancer related 
genes (based on their expression levels) by space. Nodes with 
degrees greater than ten are drawn in blue. 
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Figure S-3: Inferred RNA interaction network. 
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Figure S-4: Heatmap showing the expressions of the 449 intrinsic genes in the 172 
breast cancer tumor samples. Each column represents one sample and each row 
represents one gene. The 172 samples are divided into 5 clusters (subtypes). From 
the left to the right, the 5 subtypes are: Luminal Subtype A, Luminal Subtype B, 
ERBB2-overexpressing subtype, Basal Subtype and Normal Breast-like Subtype. 
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